Comparing Breast Cancer Prediction Models

Authors: Ritik Panchal, Prince Kumar

DOI Link: https://doi.org/10.22214/ijraset.2024.59447

Abstract

In this research study, five machine learning algorithms—Support Vector Machine (SVM), Random Forest, Logistic Regression, Decision Tree (C4.5), and K-Nearest Neighbors (KNN)—were applied to the Breast Cancer Wisconsin Diagnostic dataset. The subsequent results underwent a thorough performance evaluation and comparison among these diverse classifiers. The primary objective was to predict and diagnose breast cancer using machine learning algorithms, determining the most effective approach based on factors such as the confusion matrix, accuracy, and precision. Notably, the findings highlight that the Support Vector Machine outperformed all other classifiers, achieving the highest accuracy at 97.2%.

Introduction

I. INTRODUCTION

Breast cancer, a complex and heterogeneous disease, remains a major global health challenge with significant implications for women's well-being. Early detection is paramount for successful intervention and improved patient outcomes [1]. While traditional screening methods have played a crucial role, recent advancements in artificial intelligence (AI) and machine learning (ML) offer unprecedented opportunities to enhance the accuracy and precision of breast cancer prediction.

Breast cancer, characterized by its complexity and heterogeneity, remains a significant global health challenge, greatly impacting the well-being of women. The early detection of breast cancer is crucial for effective intervention and improved patient outcomes. While traditional screening methods have played a vital role, recent advancements in artificial intelligence (AI) and machine learning (ML) present unprecedented opportunities to enhance the precision and accuracy of breast cancer prediction. [2]

According to data released by the International Agency for Research on Cancer (IARC) in December 2020, breast cancer has taken over as the most commonly diagnosed cancer in women globally, surpassing lung cancer. Over the past two decades, the overall number of cancer cases has almost doubled, escalating from an estimated 10 million in 2000 to 19.3 million in 2020. [1]

Presently, one in every five individuals worldwide is anticipated to face a cancer diagnosis during their lifetime. Future projections indicate a significant surge in cancer diagnoses in the coming years, with estimates suggesting a nearly 50% increase by 2040 compared to 2020. Simultaneously, the number of deaths attributable to cancer has risen, reaching 10 million in 2020 from 6.2 million in 2000. More than one in six global deaths is now linked to cancer. These trends underscore the ongoing impact of cancer on a global scale. The utilization of AI and ML allows for a more nuanced understanding of the various factors contributing to breast cancer risk. These technologies can analyze large sets of data, identifying subtle patterns and interactions that might be challenging for traditional methods to detect [9]. Additionally, the predictive model can evolve and improve over time as it learns from new data, contributing to ongoing advancements in breast cancer prediction.

Moreover, the integration of genetic information enables a deeper exploration of inherited risk factors, paving the way for a more comprehensive understanding of an individual's predisposition to breast cancer. By considering lifestyle factors alongside clinical and genetic data, the model aims to provide a holistic view of risk, contributing to more effective and personalized preventive measures. [1]

II. LITERATURE SURVEY

The literature on breast cancer prediction highlights the urgent need for accurate and early identification of this pervasive global health challenge. Researchers leverage machine learning algorithms, such as Support Vector Machines (SVM) and Random Forests, along with diverse datasets like the Breast Cancer Wisconsin Diagnostic dataset, to develop predictive models. Performance metrics including accuracy, precision, sensitivity, specificity, F1 Score, and area under the ROC curve (AUC) are commonly employed for model evaluation. While significant progress has been made, there's a continued emphasis on further research, validation, and broader applications across diverse populations. Future directions include advancements in algorithmic techniques, integration of imaging data like mammograms, and addressing ethical considerations. A holistic approach, combining machine learning algorithms with clinical expertise, is advocated to enhance the effectiveness of breast cancer prediction models and contribute to improved patient outcomes.

III. METHODOLOGY

The primary goal is to predict and diagnose breast cancer using machine-learning algorithms, aiming to identify the most effective classifier based on key performance metrics, including the confusion matrix, accuracy, precision, and sensitivity. To achieve this, machine learning classifiers, including Support Vector Machine (SVM), Random Forests, Logistic Regression, Decision tree (C4.5), and K-Nearest Neighbors (KNN), were applied to the Breast Cancer Wisconsin Diagnostic dataset. The obtained results are then thoroughly evaluated to determine which model provides higher accuracy in breast cancer prediction.

A. Dataset Description

Name: Wisconsin Breast Cancer Diagnostic Dataset (WBCD)
Dataset Link: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic
Size: 50 KB
Attributes:-

IV. COMPARING MODELS

Comparing the performance of five classifiers: Support Vector Machine (SVM), Random Forest, Logistic Regression, Decision tree, and K-Nearest Neighbors (KNN Network). These classifiers are recognized in the research community as influential data mining algorithms and are considered among the top 10 data mining algorithms. The primary goal is to predict and diagnose breast cancer using machine-learning algorithms, aiming to identify the most effective classifier based on key performance metrics, including the confusion matrix, accuracy, precision, and sensitivity.

A. K-Nearest Neighbors (KNN)

Role: K-Nearest Neighbors (K-NN) assumes a pivotal role in the breast cancer prediction model, employing a methodology that evaluates the similarity of instances to ascertain the potential presence of cancer. This algorithm operates on the foundational premise that instances with similar features are likely to exhibit comparable outcomes. Specifically in the domain of breast cancer prediction, K-NN functions by classifying a new data point based on its proximity to existing instances within the feature space.
Process: This model provides a clear and concise explanation of the KNN algorithm, detailing its working flow and the significance of the parameter "K" (number of neighbors) [14]. The use of a graphical representation enhances understanding, illustrating how the algorithm classifies a test sample based on its proximity to neighbors. The discussion on choosing an appropriate value for "K" and the impact of smaller vs. larger values is insightful, addressing the trade-off between noise and decision boundary smoothness. The implementation of the KNN algorithm step by step. It covers crucial aspects such as data set splitting into features and labels, dividing the data into training and testing sets, building the predictive model, performing cross-validation, and finding the optimal number of K neighbors.
Result

TABLE I
Result of K-Nearest Neighbors (K-NN)

Result	Precision	Sensitivity	F-Measure
Benign	0.92	0.91	0.91
Malignant	0.95	0.96	0.95

VI. ACKNOWLEDGMENT

We would like to express our profound gratitude to Assistant Professor Ms. Priya Singh of the Department of Software Engineering at Delhi Technological University for her invaluable guidance, insightful feedback, and unwavering support throughout the course of this research project. Her expertise and dedication not only shaped this work but also inspired us to explore our potential. We are truly grateful for her mentorship and for providing us with the opportunity to work under her guidance. This project would not have reached its fruition without her encouragement and constructive criticism

Conclusion

On the Wisconsin Breast Cancer Diagnostic dataset (WBCD) we applied five main algorithms which are: SVM, Random Forests, Logistic Regression, Decision Tree, K-NN, calculate, compare and evaluate different results obtained based on confusion matrix, accuracy, sensitivity, precision, AUC to identify the best machine learning algorithm that are precise, reliable and find the higher accuracy. All algorithms have been programmed in Python using scikit-learn library in Anaconda environment. After an accurate comparison between our models, we found that Support Vector Machine achieved a higher efficiency of 97.2%, Precision of 97.5%, AUC of 96.6% and outperforms all other algorithms. In conclusion, Support Vector Machine has demonstrated its efficiency in Breast Cancer prediction and diagnosis and achieves the best performance in terms of accuracy and precision. It should be noted that all the results obtained are related just to the WBCD database, it can be considered as a limitation of our work, it is therefore necessary to reflect for future works to apply these same algorithms and methods on other databases to confirm the results obtained via this database, as well as, in our future works, we plan to apply our and other machine learning algorithms using new parameters on larger data sets with more disease classes to obtain higher accuracy.

References

[1] R. K. Barwal and N. Raheja, \"A Classification System for Breast Cancer Prediction using SVOF-KNN method,\" 2022 International Conference on Augmented Intelligence and Sustainable Systems (ICAISS), Trichy, India, 2022 [2] M. P. Behera, A. Sarangi, D. Mishra and S. K. Sarangi, \"Breast Cancer Prediction Using Long Short-Term Memory Algorithm,\" 2022 5th International Conference on Computational Intelligence and Networks (CINE), Bhubaneswar, India, 2022 [3] Y. Wankhade, S. Toutam, K. Thakre, K. Kalbande and P. Thakre, \"Machine Learning Approach for Breast Cancer Prediction: A Review,\" 2023 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC), Salem, India, 2023 [4] M. R. Karim, G. Wicaksono, I. G. Costa, S. Decker and O. Beyan, \"Prognostically Relevant Subtypes and Survival Prediction for Breast Cancer Based on Multimodal Genomics Data,\" in IEEE Access, vol. 7 [5] N. Arya and S. Saha, \"Multi-Modal Classification for Human Breast Cancer Prognosis Prediction: Proposal of Deep-Learning Based Stacked Ensemble Model,\" in IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 19, no. 2, pp. 1032-1041, 1 March-April 2022 [6] S. Alghunaim and H. H. Al-Baity, \"On the Scalability of Machine-Learning Algorithms for Breast Cancer Prediction in Big Data Context,\" in IEEE Access, vol. 7, pp. 91535-91546, 2019 [7] X. Wang, W. Yu, Z. Ding, X. Zhai and S. Saha, \"Modeling and Analyzing of Breast Tumor Deterioration Process with Petri Nets and Logistic Regression,\" in Complex System Modeling and Simulation, vol. 2, no. 3, pp. 264-272, September 2022 [8] M. Byra, K. Dobruch-Sobczak, Z. Klimonda, H. Piotrzkowska-Wroblewska and J. Litniewski, \"Early Prediction of Response to Neoadjuvant Chemotherapy in Breast Cancer Sonography Using Siamese Convolutional Neural Networks,\" in IEEE Journal of Biomedical and Health Informatics, vol. 25, no. 3, pp. 797-805, March 2021 [9] M. R. Karim, G. Wicaksono, I. G. Costa, S. Decker and O. Beyan, \"Prognostically Relevant Subtypes and Survival Prediction for Breast Cancer Based on Multimodal Genomics Data,\" in IEEE Access, vol. 7, pp. 133850-133864, 2019 [10] Z. Huang and D. Chen, \"A Breast Cancer Diagnosis Method Based on VIM Feature Selection and Hierarchical Clustering Random Forest Algorithm,\" in IEEE Access, vol. 10, pp. 3284-3293, 2022 [11] E. K. Jadoon, F. G. Khan, S. Shah, A. Khan and M. ElAffendi, \"Deep Learning-Based Multi-Modal Ensemble Classification Approach for Human Breast Cancer Prognosis,\" in IEEE Access, vol. 11 [12] C. McIntosh and T. G. Purdie, \"Contextual Atlas Regression Forests: Multiple-Atlas-Based Automated Dose Prediction in Radiation Therapy,\" in IEEE Transactions on Medical Imaging, vol. 35, no. 4 [13] G. Sruthi, C. L. Ram, M. K. Sai, B. P. Singh, N. Majhotra and N. Sharma, \"Cancer Prediction using Machine Learning,\" 2022 2nd International Conference on Innovative Practices in Technology and Management (ICIPTM), Gautam Buddha Nagar, India, 2022 [14] Y. Wankhade, S. Toutam, K. Thakre, K. Kalbande and P. Thakre, \"Machine Learning Approach for Breast Cancer Prediction: A Review,\" 2023 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC), Salem, India, 2023 [15] A. Bharat, N. Pooja and R. A. Reddy, \"Using Machine Learning algorithms for breast cancer risk prediction and diagnosis,\" 2018 3rd International Conference on Circuits, Control, Communication and Computing (I4C), Bangalore, India, 2018. [16] M. Sugimoto, M. Takada and M. Toi, \"Comparison of robustness against missing values of alternative decision tree and multiple logistic regression for predicting clinical data in primary breast cancer,\" 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Osaka, Japan, 2013 [17] M. Akhil and P. V. S. Kumar, \"Breast Cancer Prognosis using Machine Learning Applications,\" 2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), Greater Noida, India, 2022

Copyright

Copyright © 2024 Ritik Panchal, Prince Kumar. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET59447

Publish Date : 2024-03-26

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here